Hi! My name is Jose Antonio Montero-Tena and I am a PhD student in bioinformatics in the Department of Plant Breeding of JLU Giessen, Germany.
In my master thesis, that was recently presented, I conducted genomic analysis to define the RDMa-h2 that had been identified with SNP array in wheat chromosome 5B between 655700000 and 656600000 and is linked with increased root dry mass. BLAST revealed that the SNP allele pattern of this haplotype matched in the assemblies Lancer and Paragon and the website crop-haplotypes.com was checked to validate this observation.
Beyond this, I conducted the bioinformatic analysis in R to cluster bins of mummer alignments into haploblocks that you report in your fantastic article ‘A haplotype-led approach to increase the precision of wheat breeding’ for the comparison and chromosome of interest. My original aim was to narrow down the haploblock as much as the alignment resolution could allow so that the trait could be associated with the gene causing an increase in the root mass or that the recombination crossovers were identified. However, I discovered some inconsistencies that were due mainly to the assembly type that complicated my analysis.
I would like to give you some feedback about these inconsistencies and some suggestions that I thought could help overcoming these limitations. Furthermore, you can find everything about my master thesis work at https://github.com/jamonterotena/Haplotype-based-Pangenome-Wheat-Analysis.git . The R script and other resources alongside this presentation will be available at https://github.com/jamonterotena/Alignment-density-matters-when-assigning-bins-into-haploblocks.git
When zooming into a 10-Mbp spanning my target haploblock, between 655 and 665 Mbp in Chr5B of comparison Lancer-Paragon, the reference chromosome seemed to be poorly covered with alignments:
plot_aln_pid_and_length(data = aln_subset, xmin = zoom_start, xmax = zoom_end, ymin = 98.5, reference_name = reference_assembly, query_name = query_assembly , x_label_gap = 1000000, dot_size = 4)
This low coverage was not commonplace in this comparison throughout the whole chromosome or in all comparisons. When observing how the bins at different bin sizes had been assigned into haploblocks in this regions, I quickly realised that some sections inside this region with very few alignment, e.g. ~663-664Mbp, could have a negative effect for precise predictions:
plot_aln_and_bins(aln_subset = aln_subset, bin_size = 5.0e06, zoom_start = zoom_start, zoom_end = zoom_end, highlighted_target = target, target_text = "SNP haploblock RDMa", fill_target = "orange", color_target_text = "darkorange", fill_predictions = "green", color_prediction_text = "darkgreen", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 1000000)
## [1] "BINS AT 5-MBP BIN SIZE"
## bin perc_id_median cut_off block_no bin_start bin_end aln_number
## 1 6.55e+08 99.97763 <99.99 NO_BLOCK 6.50e+08 6.55e+08 37
## 2 6.60e+08 99.98986 <99.99 NO_BLOCK 6.55e+08 6.60e+08 24
## 3 6.65e+08 99.99159 >=99.99 1 6.60e+08 6.65e+08 18
## [1] "BLOCK SUMMARY AT 5-MBP BIN SIZE"
## bin_size comparison block_no block_start block_end
## 1 5-Mbp Lancer->Paragon 1 6.6e+08 6.65e+08
The haploblock is here assigned regardless of this ‘gap’. Although this wrong assignment could be overcome by simply reducing the bin size, at 1-Mbp we still face similar issues:
plot_aln_and_bins(aln_subset = aln_subset, bin_size = 1.0e06, zoom_start = zoom_start, zoom_end = zoom_end, highlighted_target = target, target_text = "SNP haploblock RDMa", fill_target = "orange", color_target_text = "darkorange", fill_predictions = "green", color_prediction_text = "darkgreen", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 1000000)
## [1] "BINS AT 1-MBP BIN SIZE"
## bin perc_id_median cut_off block_no bin_start bin_end aln_number
## 1 6.55e+08 99.92852 <99.99 NO_BLOCK 6.54e+08 6.55e+08 6
## 2 6.56e+08 99.97730 <99.99 NO_BLOCK 6.55e+08 6.56e+08 5
## 3 6.57e+08 99.99048 >=99.99 1 6.56e+08 6.57e+08 4
## 4 6.58e+08 99.98996 <99.99 NO_BLOCK 6.57e+08 6.58e+08 7
## 5 6.59e+08 99.74574 <99.99 NO_BLOCK 6.58e+08 6.59e+08 1
## 6 6.60e+08 99.99528 >=99.99 1 6.59e+08 6.60e+08 7
## 7 6.61e+08 99.99595 >=99.99 1 6.60e+08 6.61e+08 2
## 8 6.62e+08 99.99072 >=99.99 1 6.61e+08 6.62e+08 7
## 9 6.63e+08 99.99664 >=99.99 1 6.62e+08 6.63e+08 5
## 10 6.64e+08 98.71248 <99.99 NO_BLOCK 6.63e+08 6.64e+08 1
## 11 6.65e+08 99.97621 <99.99 NO_BLOCK 6.64e+08 6.65e+08 3
## [1] "BLOCK SUMMARY AT 1-MBP BIN SIZE"
## bin_size comparison block_no block_start block_end
## 1 1-Mbp Lancer->Paragon 1 6.56e+08 6.63e+08
These untrustworthy haploblocks happen because the current method only assigns bins to haploblocks if the median percid is not lower than the cutoff, but disregards the number of alignments located inside the bins. This can have a negative impact on haploblock predictions as illustrated in the previous plots since some bins contain segments with low density of aligments of low percid that do not cause the median percid to drop under the cutoff, although they represent a region that is large enough to be considered excluded from the haploblock prediction. Other similar situations can arise, for example haploblocks predicted upon one alignment (check figure 6 in my master thesis paper ‘Pangenome analysis of haplotypes associated with root dry mass in wheat’)
Low alignment coverage could mainly have to do with the comparison type and the chromosome. Comparisons between chromosome-level assemblies and scaffold-level assemblies have significantly lower alignment coverage, alignment length and number of alignments per bin than their counterpart comparisons between higher-quality chromosome- and chromosome-level assemblies.
image1 <- load.image("CLA vs SLA.png")
plot(image1, axes=FALSE)
This would explain the low alignment coverage in the plots from the comparison between Lancer, chromosome-level assembly, and Paragon, scaffold-level assembly. Also, alignment coverage is not homogeneously distributed between chromosomes, interestingly with Chr5B showing the lowest average coverage.
image2 <- load.image("chr 5B vs ALL.png")
plot(image2, axes=FALSE)
I would like to suggest you to consider applying modifications to the algorithm so that the haploblocks are not only assigned on the basis of their percentage of identity, but also on their alignment density, since this factor could be essential to allow comparisons between chromosome-level assemblies and scaffold-level assemblies, due to the significant differences in alignment quality properties between them. These changes could be done in many ways and I would like to propose some new functions that would include alignment number as a factor for haploblock assignment. .
If we have a look at the median number of alignments per bin, we get 19, 9 and 4 for bin sizes 5, 2.5 and 1 Mbp
binsize_list <- c(5.0e6, 2.5e6, 1.0e6)
names(binsize_list) <- c("bin size: 5-Mbp", "bin size: 2.5-Mbp", "bin size: 1-Mbp")
for (i in 1:3){
print("median aln number per bin considering the real distribution")
blocks_dataframe <- assign_blocks_mummer_edited2(aln_subset = aln_subset, aln_threshold = 18, bin_size = binsize_list[i], bin_start = 0, bin_end = max(aln_subset$re))
aln_density <- blocks_dataframe$aln_number
print(median(aln_density))
}
## [1] "median aln number per bin considering the real distribution"
## [1] 19
## [1] "median aln number per bin considering the real distribution"
## [1] 9
## [1] "median aln number per bin considering the real distribution"
## [1] 4
By considering only percid and bin size 5-Mbp, we get a haploblock at 660-665 Mbp, despite a ‘gap’ in the section ~663-664Mbp that counts with only 2 low-percid alignments
plot_aln_and_bins(aln_subset = aln_subset, bin_size = 5.0e06, zoom_start = zoom_start, zoom_end = zoom_end, highlighted_target = target, target_text = "SNP haploblock RDMa", fill_target = "orange", color_target_text = "darkorange", fill_predictions = "green", color_prediction_text = "darkgreen", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 1000000, prediction_text = FALSE)
## [1] "BINS AT 5-MBP BIN SIZE"
## bin perc_id_median cut_off block_no bin_start bin_end aln_number
## 1 6.55e+08 99.97763 <99.99 NO_BLOCK 6.50e+08 6.55e+08 37
## 2 6.60e+08 99.98986 <99.99 NO_BLOCK 6.55e+08 6.60e+08 24
## 3 6.65e+08 99.99159 >=99.99 1 6.60e+08 6.65e+08 18
## [1] "BLOCK SUMMARY AT 5-MBP BIN SIZE"
## bin_size comparison block_no block_start block_end
## 1 5-Mbp Lancer->Paragon 1 6.6e+08 6.65e+08
If we now use a threshold of 19, the median number of alignments in 5-Mbp bins, we get no haploblock in the region, since the number of alignments in the second bin is lower than 19
plot_aln_and_bins_edited2(aln_subset = aln_subset, aln_threshold = 19, bin_size = 5.0e06, zoom_start = zoom_start, zoom_end = zoom_end, highlighted_target = target, target_text = "target region", fill_target = "orange", color_target_text = "darkorange", fill_predictions = "blue", color_prediction_text = "darkblue", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 1000000, prediction_text = FALSE)
We can repeat the same procedure with the rest two bin sizes that you used in your work
Bin size 2.5-Mbp, only percid
plot_aln_and_bins(aln_subset = aln_subset, bin_size = 2.5e06, zoom_start = zoom_start, zoom_end = zoom_end, highlighted_target = target, target_text = "SNP haploblock RDMa", fill_target = "orange", color_target_text = "darkorange", fill_predictions = "green", color_prediction_text = "darkgreen", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 1000000, prediction_text = FALSE)
## [1] "BINS AT 2.5-MBP BIN SIZE"
## bin perc_id_median cut_off block_no bin_start bin_end aln_number
## 1 655000000 99.96227 <99.99 NO_BLOCK 652500000 655000000 13
## 2 657500000 99.98867 <99.99 NO_BLOCK 655000000 657500000 13
## 3 660000000 99.99211 >=99.99 1 657500000 660000000 11
## 4 662500000 99.99228 >=99.99 1 660000000 662500000 12
## 5 665000000 99.98336 <99.99 NO_BLOCK 662500000 665000000 6
## [1] "BLOCK SUMMARY AT 2.5-MBP BIN SIZE"
## bin_size comparison block_no block_start block_end
## 1 2.5-Mbp Lancer->Paragon 1 657500000 662500000
Bin size 2.5-Mbp, percid & aln number
plot_aln_and_bins_edited2(aln_subset = aln_subset, aln_threshold = 9, bin_size = 2.5e06, zoom_start = zoom_start, zoom_end = zoom_end, highlighted_target = target, target_text = "target region", fill_target = "orange", color_target_text = "darkorange", fill_predictions = "blue", color_prediction_text = "darkblue", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 1000000, prediction_text = FALSE)
Bin size 1-Mbp, only percid
plot_aln_and_bins(aln_subset = aln_subset, bin_size = 1.0e06, zoom_start = zoom_start, zoom_end = zoom_end, highlighted_target = target, target_text = "SNP haploblock RDMa", fill_target = "orange", color_target_text = "darkorange", fill_predictions = "green", color_prediction_text = "darkgreen", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 1000000, prediction_text = FALSE)
## [1] "BINS AT 1-MBP BIN SIZE"
## bin perc_id_median cut_off block_no bin_start bin_end aln_number
## 1 6.55e+08 99.92852 <99.99 NO_BLOCK 6.54e+08 6.55e+08 6
## 2 6.56e+08 99.97730 <99.99 NO_BLOCK 6.55e+08 6.56e+08 5
## 3 6.57e+08 99.99048 >=99.99 1 6.56e+08 6.57e+08 4
## 4 6.58e+08 99.98996 <99.99 NO_BLOCK 6.57e+08 6.58e+08 7
## 5 6.59e+08 99.74574 <99.99 NO_BLOCK 6.58e+08 6.59e+08 1
## 6 6.60e+08 99.99528 >=99.99 1 6.59e+08 6.60e+08 7
## 7 6.61e+08 99.99595 >=99.99 1 6.60e+08 6.61e+08 2
## 8 6.62e+08 99.99072 >=99.99 1 6.61e+08 6.62e+08 7
## 9 6.63e+08 99.99664 >=99.99 1 6.62e+08 6.63e+08 5
## 10 6.64e+08 98.71248 <99.99 NO_BLOCK 6.63e+08 6.64e+08 1
## 11 6.65e+08 99.97621 <99.99 NO_BLOCK 6.64e+08 6.65e+08 3
## [1] "BLOCK SUMMARY AT 1-MBP BIN SIZE"
## bin_size comparison block_no block_start block_end
## 1 1-Mbp Lancer->Paragon 1 6.56e+08 6.63e+08
Bin size 1-Mbp, percid & aln number
plot_aln_and_bins_edited2(aln_subset = aln_subset, aln_threshold = 4, bin_size = 1.0e06, zoom_start = zoom_start, zoom_end = zoom_end, highlighted_target = target, target_text = "target region", fill_target = "orange", color_target_text = "darkorange", fill_predictions = "blue", color_prediction_text = "darkblue", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 1000000, prediction_text = FALSE)
Now, let’s have a look across the whole chromosome 5B
Only percid
plot_aln_and_bins(aln_subset = aln_subset, bin_size = 5.0e06, zoom_start = 0, zoom_end = max(aln_subset$re), highlighted_target = target, target_text = NA, fill_target = "orange", color_target_text = "darkorange", fill_predictions = "green", color_prediction_text = "darkgreen", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 500000000, print_tables = FALSE, prediction_text = FALSE, aln_text = FALSE)
percid & aln number
plot_aln_and_bins_edited2(aln_subset = aln_subset, aln_threshold = 19, bin_size = 5.0e06, zoom_start = 0, zoom_end = max(aln_subset$re), highlighted_target = NA, fill_predictions = "blue", color_prediction_text = "darkblue", ymin = 98.5, cut_off = 99.99, reference_name = reference_assembly, query_name = query_assembly, dot_size = 4, x_label_gap = 500000000, print_tables = FALSE, prediction_text = FALSE, aln_text = FALSE, vline = FALSE)
Some sections will no longer be assigned because they are not enough covered with alignments.
Notice that I applied a restrictive criteria by using the median number of alignments. This is because comparisons between chromosome- and scaffold-level assemblies, as this one between Lancer and Paragon, have lower alignment coverage than their counterparts, so probably high thresholds should be used to allow to compared between them. Furthermore, the method could be expanded to, for example, show color ranges depending on the number of alignments across the x axis. The website could benefit from this new algorithm by, for example, giving information about how good alignment density is in each haploblock between pairwise comparisons.
If you would like to replicate these analyses or apply them on different databases do not forget that you can find these in the R script at https://github.com/jamonterotena/Alignment-density-matters-when-assigning-bins-into-haploblocks.git , containing all the functions required. The alignment databases that I utilised for this work were borrowed from your original article and can be found here https://opendata.earlham.ac.uk/wheat/under_license/toronto/Brinton_etal_2020-05-20-Haplotypes-for-wheat-breeding/nucmer/ . I am sure that you will find these resources useful and have lots of ideas about how to implement these into a potential update on your method.
I would really like to know what you think about my work. I also would like to tell you that I would be very proud to collaborate with you in the future.
Thanks for your attention!